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Abstract-  Data  sources  that  are  typical  of  the  next  genera¬ 
tion  of  biological  information  entities  are  gene  chips  that 
identify  the  individual  genes  in  a  given  biological  sample. 
These  data  are  currently  stored  in  a  database  format  de¬ 
fined  by  the  Genetic  Analysis  Technology  Consortium 
(GATC).  To  interpret  the  chip  data,  we  also  need  infor¬ 
mation  about  the  genes  themselves,  as  found  in  the  Hu¬ 
man  Genome  Database  (HGDB).  These  two  databases 
were  conceived  at  different  times  to  serve  different  pur¬ 
poses,  and  their  designs  differ  significantly.  Extracting 
information  simultaneously  from  multiple  databases  has 
proved  to  be  a  very  difficult  problem. 

We  have  developed  a  system  that  will  intelligently  direct 
a  single  client  query  against  a  federation  of  databases. 
Our  solution  uses  software  standards  common  in  the  field 
today  -  XML,  CORBA,  and  Java  -  but  these  standards  by 
themselves  are  not  sufficient.  We  have  developed  a  new 
component  called  the  Class  Mapper,  a  software  layer 
unique  to  each  database.  Each  Class  Mapper  represents 
its  database  as  an  object-oriented  schema  consistent  with 
the  schema  level  of  the  federation.  A  Federation  Platform 
reads  the  query,  the  Class  Mappers  execute  the  query 
across  their  respective  databases,  and  the  Federation  Plat¬ 
form  returns  results  to  the  client. 
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I.  Introduction 

Recent  advances  in  biology  have  produced  an  extraordi¬ 
nary  number  and  variety  of  data  sources  that  must  be  used 
together  to  address  problems  in  medicine  and  biology  [1-3]. 
Among  the  most  visible  are  the  data  sources  that  capture  our 
understanding  of  the  human  genome.  Prominent  among  these 
is  the  Human  Genome  Database  (HGDB)  [4]  that  was  de¬ 
signed  to  be  a  central  repository  for  accumulating  and  dis¬ 
seminating  genetic  data.  It  contains  three  main  types  of 
information:  regions  of  the  human  genome;  maps  of  the  hu¬ 
man  genome;  and  variations  including  mutations  and  poly¬ 
morphisms.  The  size  of  the  HGDB  is  more  than  a  terabyte. 

A  second  class  of  data  sources  comes  from  experiments.  A 
representative  of  these  data  is  the  GATC  format  developed  by 
the  Genetic  Analysis  Technology  Consortium  [5]  to  capture 
the  information  in  gene  chips  that  contain  105  or  more  indi¬ 
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vidual  reaction  wells,  each  of  which  represents  an  individual 
genetic  test.  A  single  gene  chip  entry  in  a  GATC  database 
would  contain  intensity  information  for  each  individual  gene 
assayed  by  the  chip  as  well  as  metadata  giving  the  details  of 
the  experimental  conditions  and  the  analysis  protocols. 

We  have  taken  the  HGDB  and  GATC  databases  as  repre¬ 
sentative  of  the  many  data  sources  that  must  be  addressed 
together.  Our  objective  is  to  be  able  to  write  a  single  query 
that  will  extract  data  from  several  databases  simultaneously 
and  in  an  intelligent  manner.  This  means  identifying  the  in¬ 
dividual  databases  where  specific  data  exist,  parsing  the 
query  to  address  the  right  databases,  and  reassembling  the 
result  in  such  a  way  that,  from  the  client  point  of  view,  the 
process  is  indistinguishable  from  a  single  query  against  a 
monolithic  database. 

The  technical  literature  reveals  a  large  number  of  attempts 
to  federate  databases  using  different  methods  and  approaches. 
[6-14]  This  activity  was  very  pronounced  during  the  period 
around  1990-1994,  but  nearly  all  of  those  projects,  including 
a  promising  effort  called  Pegasus  at  Hewlett  Packard  [13], 
seemed  to  disappear  during  the  ensuing  5  years.  In  particular, 
the  Pegasus  project  was  to  have  users  add  remote  schemas  to 
be  imported  into  the  Pegasus  database,  thus  making  it  a  dy¬ 
namic  federated  database.  Non-object-oriented  schemas  were 
mapped  to  object-oriented  representations  within  the  global 
database.  The  global  access  language  HOSQL  (Heterogene¬ 
ous  Object  SQL)  had  features  of  a  multidatabase  language 
system;  however,  local  users  were  responsible  for  integrating 
imported  schemas.  All  traces  of  this  work  vanished  after 
1993.  Other  notable  approaches,  including  a  system  called 
MARGBench  that  originated  at  the  Otto-von-Guericke  Un- 
versity  in  Magdeburg,  Germany  [7],  are  discussed  in  the  ref¬ 
erences. 

The  present  work  builds  on  previous  efforts  in  our  group  to 
develop  a  common  access  technology  between  different  data¬ 
bases.  This  approach  came  from  the  realization  that  writing 
and  maintaining  unique  access  software  for  each  individual 
database  would  make  the  concept  of  federated  databases 
nearly  unworkable.  The  mechanism  that  was  developed  is 
called  the  Class  Mapper  [15-17].  The  Class  Mapper  is  a 
software  layer  that  represents  its  underlying  database  to  the 
federation  as  an  object-oriented  schema  (a  ClassMap)  that 
completely  describes  the  functionality  of  the  underlying  data¬ 
base. 
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The  Class  Mapper  also  supplies  the  procedures  to  query  the 
database  in  its  own  language  and  return  results  to  the  federa¬ 
tion.  Data  exchange  between  the  Class  Mapper  and  the  fed¬ 
eration  can  then  be  carried  out  using  reusable  tools,  and 
queries  of  one  database  appear  to  be  operationally  identical  to 
the  queries  of  any  other  database  in  the  federation.  Transport 
protocols  can  then  be  built  using  standards  such  as  JDBC  or 
CORBA  to  evoke  TCP/IP.  XML  can  also  be  used  to  define 
the  data  packages  exchanged;  this  is  discussed  elsewhere 
[15]. 

II.  The  Federation  Platform 

The  database  federation  architecture  consists  of  a  client,  a 
Federation  Platform  with  a  local  database,  and  two  or  more 
external  databases  that  contain  information  used  to  compose 
an  answer  to  the  query  from  the  client.  Fig.  1  presents  an 
overview  of  the  transaction  architecture.  We  have  used  the 
problem  of  federating  the  GATC  database  and  the  HGDB 
database  in  this  example.  Numbers  in  circles  indicate  the 
steps  in  processing  a  query  that  initiates  at  the  client  at  point 
®.  The  Federation  Platform  is  designed  to  parse  the  query, 
communicate  with  the  Class  Mappers  on  the  individual  data¬ 
bases,  and  create  and  maintain  a  transient  database  (the  Local 
Database  in  Fig.  1)  to  efficiently  extract  specific  information 
to  satisfy  the  query. 


Fig.  1  The  transaction  architecture  in  a  federated  database. 


Steps  ©  and  ©  represent  separate  queries  to  the  GATC 
and  HGDB  databases  to  retrieve  individual  parts  of  the  in¬ 
formation  that  will  be  used  to  satisfy  the  query  in  (I).  What  is 
returned  to  the  Federation  Platform  is  a  series  of  partial  tables 
from  the  GATC  and  the  HGDB.  These  partial  tables  are  then 
used  in  (i)  to  construct  a  Local  Database  that  contains  only 
information  relevant  to  the  current  query.  The  fact  that  these 
tables  are  written  back  to  the  Federation  Platform  upon  each 
query  insures  that  the  data  used  in  answering  the  query  are 
current.  The  actual  query  and  other  processing  algorithms  are 


run  against  the  Local  Database  in  ©,  returning  the  result  ©. 
At  the  end  of  the  query,  the  information  in  the  Local  Database 
is  discarded. 

By  using  a  Local  Database,  all  of  the  results  that  have  been 
retrieved  from  the  remote  databases  are  easily  accessible  us¬ 
ing  standard  database  tools  that  are  designed  for  such  pur¬ 
poses.  Storage  scalability  is  a  second  advantage.  Database 
systems  are  built  to  be  scalable  to  terabytes  with  good  per¬ 
formance.  With  good  commercial  database  technology,  one 
can  scale  to  multi-gigabyte  fde  queries  without  severe  per¬ 
formance  penalties.  Without  using  database  functionality, 
such  fde  sizes  could  quickly  exceed  the  capacity  of  system 
memory,  forcing  the  use  of  brute-force  memory  swapping  to 
handle  complex  queries.  Finally,  the  Local  Database  can  take 
advantage  of  built-in  database  management  features  such  as 
query  optimization,  security,  recovery,  and  data  integrity. 

Although  we  have  yet  to  implement  it,  one  could  also  make 
use  of  local  cache  mechanisms  to  allow  partial  tables  from  the 
external  databases  to  be  saved  between  queries  so  that  build¬ 
ing  sequential  Local  Databases  from  the  same  partial  tables 
could  be  accelerated.  A  mechanism  much  like  the  cache  con¬ 
trol  used  on  internet  browsers  could  be  adopted.  This  could 
also  reduce  network  traffic  in  cases  where  the  partial  tables 
were  large  and/or  the  network  connections  were  slow. 

III.  Implementation 

Our  first  implementation  of  the  GATC-HGDB  database 
federation  was  implemented  on  a  Sun  450  running  Solaris  7 
using  local  disks  and  separate  databases  representing  the 
GATC  and  HGDB  databases.  We  built  all  three  databases  - 
the  GATC  database,  the  Local  Database,  and  the  representa¬ 
tive  HGDB  database  -  using  Informix  Dynamic  Server  2000. 
For  the  HGDB  database,  we  did  not  use  the  entire  database  at 
the  NIH  because  it  is  over  a  terabyte  in  size.  The  difficulties 
of  replicating  and  keeping  current  such  a  large  local  database 
would  have  detracted  from  our  main  goal  of  achieving  feasi¬ 
bility  testing.  A  partial  HGDB  database  (we  call  it  "HGDB 
Lite")  was  therefore  built  locally  to  handle  information  spe¬ 
cifically  relevant  to  the  test  queries  that  we  were  writing.  It 
was  built  to  conform  to  the  full  HGDB  Class  Map  as  defined 
by  Chuang  [17];  we  simply  populated  a  small  fraction  of  the 
available  tables. 

Class  Mappers  were  available  for  both  the  GATC  and  the 
complete  HGDB  databases  from  the  work  of  William  Chuang 
[17].  The  GATC  database  was  populated  with  about  500  sets 
of  Affymetrix  gene  chip  data  that  were  kindly  supplied  by 
Michael  Cardone  of  the  MIT  Department  of  Biology.  No 
Class  Mapper  was  required  for  the  Local  Database  because  it 
was  created  on  the  fly  by  the  Federation  Platform  itself  and 
therefore  was  completely  known  to  the  program.  The 
"HGDB  Lite"  had  about  two  dozen  tables  and  roughly  a 
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GByte  of  representative  data  that  covered  a  range  of  queries 
related  to  GATC  information. 

The  DNA  identifiers  (called  Accession  ID’s)  in  the  GATC 
database  correspond  to  DNA  segments  with  genomic  charac¬ 
teristics.  However,  these  characteristics  are  stored  in  the  ta¬ 
bles  and  connections  of  the  HGDB.  In  order  to  associate  a 
particular  DNA  segment  from  an  experiment  with  its  genomic 
characteristics,  the  two  databases  must  be  used  in  parallel. 
The  problem  then  becomes  the  task  of  querying  data  across 
the  two  database  domains. 

Merging  the  two  databases  does  not  represent  a  feasible  so¬ 
lution  to  the  problem.  Both  databases  are  dynamic  because 
new  experimental  data  are  being  continually  added  to  the 
local  GATC  database  and  the  HGDB  is  itself  an  evolving 
entity.  As  our  prototype  demonstrates,  federation  offers  a 
viable  solution  that  can  be  applied  to  multiple  databases  and 
multiple  queries. 

As  might  be  expected  from  the  design  shown  if  Fig.  1,  the 
Federation  Platform  encompasses  a  significant  amount  of 
functionality.  First,  the  ClassMap  for  each  database  of  the 
federation  is  stored  in  the  Federation  Platform  and  used  to 
construct  a  global  description  of  the  entire  federation  called  a 
ClassMapRepository.  A  hash  table  is  used  for  fast  lookups 
of  table-to-database  mappings  and  conflict  resolution  in  case 
more  than  one  database  has  the  same  table  name. 

From  this  global  description,  a  series  of  data  structures  are 
created:  these  structures  define  the  appropriate  means  of  que¬ 
rying  and  accessing  any  individual  piece  of  data  in  the  entire 
federation.  The  details  of  this  design  can  be  found  in  the 
Thesis  by  Ben  Fu  [18].  Original  queries  from  the  client  are 
parsed  against  this  global  description  by  the  Federation  Plat¬ 
form,  and  the  relevant  data  are  automatically  queried  from  the 
selected  tables  in  the  several  databases.  The  original  query 
structure  is  then  used  to  create  the  Local  Database  and  a  final 
query  is  made  against  the  Local  Database  to  produce  the  re¬ 
sults  that  are  returned  to  the  client. 

In  this  prototype,  extensive  use  was  made  of  JDBC  to 
query  the  databases,  construct  the  Local  Database,  and  handle 
the  table  objects.  This  was  convenient  because  Informix  sup¬ 
ports  a  Type  4  (native)  JDBC  driver.  The  object  support 
within  Java  and  JDBC  also  simplified  many  of  the  internal 
structures  [19].  It  is  entirely  consistent  with  this  architecture 
to  use  other  methods,  including  ODBC  and  CORBA,  to  cre¬ 
ate  the  internal  data  structures  and  carry  out  the  communica¬ 
tion  with  the  Class  Mappers.  We  have  done  so  on  other 
related  problems  with  good  success. 

IV.  Discussion  and  Conclusions 

This  paper  has  presented  a  new  approach  to  database  fed¬ 
eration  that  we  believe  will  be  useful  in  the  coming  decade  of 


bioinformatics.  The  amount  of  biological  data  that  is  being 
created  is  staggering.  The  human  genome  is  only  the  begin¬ 
ning.  New  databases  are  being  constructed  for  different  spe¬ 
cies,  for  different  classes  of  proteins,  and  for  molecular 
pathways.  Coupled  with  these  data  are  new  experiments  that 
yield  protein  information,  genetic  information  and  gene  frag¬ 
ment  data.  The  task  of  creating  paths  back  and  forth  between 
these  different  data  sources  is  truly  daunting. 

Our  short-term  goal  has  been  to  pick  a  single  pair  of  impor¬ 
tant  databases  and  demonstrate  an  architecture  that  can  be 
used  to  complete  cross-database  queries  using  standard  data¬ 
base  tools  and  programming  methodologies.  Our  results  are 
readily  generalizable  to  other  experimental  methods  and  other 
database  collections.  No  compromises  have  been  made  that 
would  restrict  this  architecture  from  scaling  to  very  large  da¬ 
tabases  and  large  numbers  of  individual  data  sources.  An 
extended  scheme  is  shown  in  Fig.  2. 


Fig.  2.  The  general  Federation  Platform  architecture. 
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